Search CORE

270 research outputs found

Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions

Author: Carpentier Alexandra
Munos Rémi
Publication venue
Publication date: 01/01/2012
Field of study

We consider the problem of adaptive stratified sampling for Monte Carlo integration of a differentiable function given a finite number of evaluations to the function. We construct a sampling scheme that samples more often in regions where the function oscillates more, while allocating the samples such that they are well spread on the domain (this notion shares similitude with low discrepancy). We prove that the estimate returned by the algorithm is almost similarly accurate as the estimate that an optimal oracle strategy (that would know the variations of the function everywhere) would return, and provide a finite-sample analysis.Comment: 23 pages, 3 figures, to appear in NIPS 2012 conference proceeding

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit

Author: Carpentier Alexandra
Munos Rémi
Publication venue
Publication date: 01/01/2012
Field of study

We consider a linear stochastic bandit problem where the dimension

K

of the unknown parameter

\theta

is larger than the sampling budget

n

. In such cases, it is in general impossible to derive sub-linear regret bounds since usual linear bandit algorithms have a regret in

O(K\sqrt{n})

. In this paper we assume that

\theta

S-

sparse, i.e. has at most

S-

non-zero components, and that the space of arms is the unit ball for the

||.||_2

norm. We combine ideas from Compressed Sensing and Bandit Theory and derive algorithms with regret bounds in

O(S\sqrt{n})

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

Bandit Algorithms for Tree Search

Author: Coquelin Pierre-Arnaud
Munos Rémi
Publication venue
Publication date: 01/01/2007
Field of study

Bandit based methods for tree search have recently gained popularity when applied to huge trees, e.g. in the game of go (Gelly et al., 2006). The UCT algorithm (Kocsis and Szepesvari, 2006), a tree search method based on Upper Confidence Bounds (UCB) (Auer et al., 2002), is believed to adapt locally to the effective smoothness of the tree. However, we show that UCT is too ``optimistic'' in some cases, leading to a regret O(exp(exp(D))) where D is the depth of the tree. We propose alternative bandit algorithms for tree search. First, a modification of UCT using a confidence sequence that scales exponentially with the horizon depth is proven to have a regret O(2^D \sqrt{n}), but does not adapt to possible smoothness in the tree. We then analyze Flat-UCB performed on the leaves and provide a finite regret bound with high probability. Then, we introduce a UCB-based Bandit Algorithm for Smooth Trees which takes into account actual smoothness of the rewards for performing efficient ``cuts'' of sub-optimal branches with high confidence. Finally, we present an incremental tree search version which applies when the full tree is too big (possibly infinite) to be entirely represented and show that with high probability, essentially only the optimal branches is indefinitely developed. We illustrate these methods on a global optimization problem of a Lipschitz function, given noisy data

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL-Polytechnique

HAL UVSQ

Performance Bounds in $L_p$ norm for Approximate Value Iteration

Author: Munos Rémi
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2007
Field of study

International audienceApproximate Value Iteration (AVI) is a method for solving large Markov Decision Problems by approximating the optimal value function with a sequence of value function representations

V_n

processed according to the iterations

V{n+1} = \mathcal{ATV}_n

where

\mathcal{T}

is the so-called Bellman operator and

\mathcal{A}

an approximation operator, which may be implemented by a Supervised Learning (SL) algorithm. Usual bounds on the asymptotic performance of AVI are established in terms of the

L\infty

-norm approximation errors induced by the SL algorithm. However, most widely used SL algorithms (such as least squares regression) return a function (the best fit) that minimizes an empirical approximation error in

L_p

-norm

(p\geq1)

. In this paper, we extend the performance bounds of AVI to weighted

L_p

-norms, which enables to directly relate the performance of AVI to the approximation power of the SL algorithm, hence assuring the tightness and pratical relevance of these bounds. The main result is a performance bound of the resulting policies expressed in terms of the

L_p

-norm errors introduced by the successive approximations. The new bound takes into account a concentration coefficient that estimates how much the discounted future-state distributions starting from a probability measure used to assess the performance of AVI can possibly differ from the distribution used in the regression operation. We illustrate the tightness of the bounds on an optimal replacement problem

HAL - Lille 3

INRIA a CCSD electronic archive server

Optimistic optimization of deterministic functions without the knowledge of its smoothness

Author: Munos Rémi
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceWe consider a global optimization problem of a deterministic function f in a semi-metric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of . We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semi-metric under which f is smooth, and whose performance is almost as good as DOO optimally-fitted

HAL - Lille 3

INRIA a CCSD electronic archive server

Analyse en norme $L_p$ de l'algorithme d'itérations sur les valeurs avec approximations

Author: Munos Rémi
Publication venue: 'Lavoisier'
Publication date: 01/01/2007
Field of study

Approximate Value Iteration (AVI) is a method for solving a large Markov Decision Problem by approximating the optimal value function with a sequence of value representations Vn processed by means of the iterations

V_{n+1} = \mathcal{AT}V_n

where

\mathcal{T}

is the so-called Bellman operator and

\mathcal{A}

, an approximation operator, which may be implemented by a Supervised Learning (SL) algorithm. Previous results relate the asymptotic performance of AVI to the

L\infty

-norm of the approximation errors induced by the SL algorithm. Unfortunately, the SL algorithm usually perform a minimization problem in

L_p

-norms

(p \geq 1)

, rendering the

L\infty

performance bounds inadequate. In this paper, we extend these performance bounds to weighted

L_p

-norms. This enables to relate the performance of AVI to the approximation power of the SL algorithm, which guarantees the tightness and pratical interest of these bounds. We illustrate the tightness of the bounds on an optimal replacement problem.L'algorithme d'itérations sur les valeurs avec approximations (IVA) permet de résoudre des problèmes de décision markoviens en grande dimension en approchant la fonction valeur optimale par une séquence de représentations

V_n

calculées itérativement selon

V_{n+1} = \mathcal{AT}V_n

où

\mathcal{T}

est l'opérateur de Bellman et

\mathcal{A}

un opérateur d'approximation, ce dernier pouvant s'implémenter selon un algorithme d'apprentissage supervisé (AS). Les résultats usuels établissent des bornes sur la performance de IVA en fonction de la norme

L\infty

des erreurs d'approximation induites par l'algorithme d'AS. Cependant, un algorithme d'AS résout généralement un problème de régression en minimisation une norme

L_p (p\geq 1)

, rendant les majorations d'erreur en norme

L\infty

inadéquates. Dans cet article, nous étendons ces résultats de majoration à des normes

L_p

pondérées. Ceci permet d'exprimer les performances de l'algorithme IVA en fonction de la puissance d'approximation de l'algorithme d'AS, ce qui garantit la finesse et l'intérêt applicatif de ces bornes. Nous illustrons numériquement la qualité des majorations obtenues pour un problème de remplacement optimal

HAL - Lille 3

INRIA a CCSD electronic archive server

Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Munos Rémi
Publication venue: Microtome Publishing
Publication date: 01/01/2006
Field of study

International audienceWe study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V. Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρ^N) (with ρ<1) up to a threshold that depends on the approximation error V-AV, where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. AV=V), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Polytechnique

HAL UVSQ

Pure Exploration for Multi-Armed Bandit Problems

Author: Bubeck Sébastien
Munos Rémi
Stoltz Gilles
Publication venue
Publication date: 01/01/2009
Field of study

We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

Best-Arm Identification in Linear Bandits

Author: Lazaric Alessandro
Munos Rémi
Soare Marta
Publication venue
Publication date: 04/11/2014
Field of study

We study the best-arm identification problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter

\theta^*

and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies that pull arms to identify the best arm with a fixed confidence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the

G

-optimality criterion used in optimal experimental design.Comment: In Advances in Neural Information Processing Systems 27 (NIPS), 201

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server